본 문서에서는 R 패키지 texmex를 이용해 declustering을 하는 방법에 대해서 소개한다.
library(texmex)
## Loading required package: mvtnorm
## Loading required package: ggplot2
library(ggplot2)
library(gridExtra)
library(pdp) #for grid.arrange fct
palette(c("black","purple","cyan","orange"))
set.seed(20120118)
여기서는 Ferro and Segers (2003) 의 방법을 요약하였다.
\(\{\xi_n : n=1,2,\ldots \}\): a strict stationary sequence of random variables, with marginal distribution function \(F\) and tail function \(\bar{F}\).
\(F\) may have a finite upper end point or it may be a distribution with an infinite upper tail.
\(M_n\): maximum of the first \(n\geq 1\) observations of the process.
우리는 the behavior of exceesses of \(\{\xi_n : n=1,2,\ldots \}\) above threshold에 관심이 있다.
extremal index is a measure of the extent to which threshold exceedances from the sequence \(\{\xi_n : n=1,2,\ldots \}\) cluster in the limit as the threshold tends to the upper endpoint of the support of \(F\).
Definition 1 (Extremal index) (Leadbetter 1983)
The sequence \(\{\xi_n : n=1,2,\ldots \}\) has an extremal index \(\theta \in [0,1]\) if for each \(\tau > 0\), there is a sequence \(\{u_n: n=1,2,\ldots \}\) for which both
\[ \begin{align*} n\bar{F}(u_n) &\rightarrow \tau\\ P(M_n \leq u_n) &\rightarrow \exp(-\theta \tau) \end{align*} \]
as \(n \rightarrow \infty\).
The first condition ensures that \(u_n\) increases at an appropriate rate to find the dependence structure in the process
The second shows that the growth of process maxima is mitigated by any dependence within the sequence
Values of \(\theta < 1\) indicate a tendancy of threshold exceedances to cluster together as the threshold approaches its limit, whereas if \(\theta = 1\), then threshold exceeses occur in isolation in the limit.
Extremal index는 \(\frac{1}{\theta}\)가 limiting mean size of clusters of exceedances of the threshold \(u\)라고 생각하면 된다. (Dey and Yan 2015)
inter-cluster times: consecutive exceedances occur in different but adjacent clusters
intra-cluster times: consecutive exceedances are adjacent observations occur- ing in the same cluster
Inter-cluster times are necessarily larger in distribution than intra-cluster times.
Ferro and Segers (2003) 은 intervals estimator라는 것을 제안하였다. \(T(u)\) be a random variable having the same distribution as the interexceedance times associated with threshold \(u\): \[\min \{n\geq 1: \xi_{n+1}>u \} \text{ given that } \xi_1 >u.\]
Ferro and Segers (2003) show that as the threshold \(u\) tends to the upper limit of the support of \(F\),
\[\begin{equation} \bar{F}(u) T(u) \stackrel{D}{\rightarrow} T_{\theta}, \tag{1} \end{equation}\]
where \(\stackrel{D}{\rightarrow}\) denotes convergence in distribution and the random variable \(T_{\theta}\) follows the mixture distribution: \[(1-\theta)\epsilon_{0} + \theta \mu_{\theta}.\]
Here
\(\epsilon_{0}\) is the degenerate distribution with point mass at 0,
\(\mu_{\theta}\) is the Exponential distribution with mean \(\frac{1}{\theta}\).
The extremal index has a dual purpose here:
\(\theta\) is the limiting proportion of non-zero interexceedance times (the proportion of interexceedance times which are also inter-cluster times)
it is also 1/(mean of the non-zero interexceedance times)
These porperties are exploited (부당하게 이용하다) to obtain a moment estimator of the extremal index \(\theta\), the intervals estimator.
Under Ferro and Segers (2003) model, the extremal index \(\theta\) arises as the proportion of interexceedance times which are also inter-cluster times. This means that interexceedance times can be easily categorised in these two groups as a natural consquence of this model. Assuming that we have observed \(N\) threshold exceedances, then the \(\lfloor \theta N \rfloor\) largest interexceedance times are assumed to be approximately independent inter-cluster times, which separate the remaining exceedance times into intra-cluster times.
In practice, the estimated value of \(\theta\) is used to identify the critical cut-off interexceedance time that distinguishes inter- fro intra-cluster times.
Once clusters have been identified, the cluster characteristics can be exam- ined, or the original threshold exceedances can be thinned to give approximately independent cluster maxima. Inference on these cluster maxima can then be carried out – such inference is much more straightforward than inference on the original dependent sequence, although arguably this approach is wasteful of information as it discards all but the largest observation in each cluster.
The uncertainty in the estimation of \(\theta\) and in the subsequent cluster identification is estimated using a bootstrap algorithm. This algorithm maintains the within-cluster dependence structure, and exploits the independence between clusters by resampling clusters and inter-cluster times rather than individual observations from the original dependent sequence. More specifically:
resample with replacement from the set of \(C-1\) observed inter-cluster times;
resample with replacement from the set of \(C\) observed clusters (obtaining both intra-cluster exceedance times and the sizes of associated threshold excesses);
interpose (사이에 끼우다) the interexceedance times and clusters to obtain a bootstrap sample of the process of threshold excesses and their times of occurrance;
estimate \(\theta\) and obtain \(N\) for the bootstrap process and decluster using these values;
estimate cluster characteristics or find cluster maxima and carry out inference on these.
The resulting estimates of \(\theta\) and any other parameters associated with cluster characteristics represent a boostrap estimate of the sampling distribution of these estimators.
위에서 소개한 방법이 Ferro and Segers (2003) 에서 소개되었던 방법이나, 데이터에서 가장 큰 값에 매무 민감해진다는 문제점이 있다. texmex 패키지에서는 위에서 제시한 nonparametric bootstrap 방법을 step 5를 다음과 같이 바꿈으로써 보완하고 있다.
새로 제안한 방법은 nonparametric 방법이 아닌 semi-parametric 방법이 된다. We sample from the empirical distribution of interexceedance times and from the fitted parametric distribution describing the cluster maxima.
The model adopted for inference about the extremal index and subsequent declustering arises in the limit as the threshold attains the upper end point of the distribution \(F\). Whilst (~하는동안) this is the case, to carry out this inference, the threshold must be set at a finite level, balancing opposing pulls on the value which it takes. The threshold should be high enough that the underlying distribution is well approximated by its limiting form. It should preferrably also be sufficiently low that we can use observed threshold excesses to estimate the model parameters with some precision. There are two diagnostic tools which we put forward to aid the selection of a suitable threshold, both of which exploit the properties of the Ferro and Segers model in Equation 1.
The distribution of normalised interexceedance times in (1) is a mixture distribution. Having estimated \(\hat{\theta}\) by using the intervals estimator, we can check the goodness of fit of this underlying model: \(\hat{\theta}\) estimates the proportion of interexceedance times which correspond to inter-cluster times and whose normalised values should be well approximated by the Exponential distribution with means \(\frac{1}{\hat{\theta}}\). The remaining normalised interexceedance times are intra-cluster times and should be small relative to the inter-cluster times.
Ferro and Segers (2003) propose a quantile-quantile (Q-Q) plot of observed normalised interexceedance times against standard Exponential quantiles to diagnose model fit. Its appearance should be slightly different from a standard Q-Q plot (which shows the data hugging a straight line in the case of good model fit). In our case, since the underlying limit model is a mixture of degeneracy at zero and the Exponential distribution, we look instead for a broken stick shape.
The \((1-\hat{\theta})\) quantile is indicated with a vertical line: above this line we look for the observed and theoretical quantiles hugging a straight line with gradient \(\frac{1}{\hat{\theta}}\) (also marked); below this line we look for a sudden attenuation of the observed times close to zero.
We exploit another useful property of the extremal index estimator to derive a further tool for threshold selection. This property is that of threshold stability, which is used in threshold selection for many other extreme value models.
The threshold stability of the extremal index estimator refers to its invariance to change in threshold above a suitably high threshold. Simply put, once a threshold is high enough, raising the threshold further should not dramatically change the estimated value of \(\theta\). As the threshold increases, the sample size used for estimation will fall so sampling uncertainty will increase, but the estimated values of \(\hat{\theta}\) should not alter above this anticipated sampling variation.
Any apparent trend in the estimated \(\theta\) with threshold is an indication that the threshold is too low.
The threshold stability plot examines a range of thresholds for invariance of \(\hat{\theta}\) to change in threshold. At each threshold spanning a range of thresholds, the extremal index is calculated and a confidence interval estimated by using the bootstrap. These are plotted against threshold. The plot is used to choose the lowest threshold above which estimates of \(\theta\) are approximately constant.
자료는 영국 남서부 지역의 daily rainfall 자료로 1914년부토 1962년까지 관찰되었다.
########################################
#Statistical inference for clusters of extremevalues
########################################
#Daily rainfall series
ggplot(data=data.frame(rain=rain,index=1:length(rain)),aes(index,rain))+geom_point(alpha=0.5,col="black")
우선 extremal index에 대한 threshold stability plot을 살펴보고 the parameters of the GPD fitted to cluster maxima를 출력할 것이다.
########################################
#Extremal index estimation and declustering
########################################
erf<-extremalIndexRangeFit(rain,nboot=20,verbose=FALSE)
#p<-ggplot(erf)
#gridExtra::grid.arrange(p[[1]],p[[2]],p[[3]],ncol=1)
par(3,1)
## [[1]]
## NULL
##
## [[2]]
## NULL
plot(erf)
위에서 출력된 그림을 통해 estimated parameters of the GPD들이 threshold 값에 상관없이 비교적 안정적으로 추정되었으나, extremal index의 추정값 \(\hat{\theta}\)는 불안정하게 추정되었음을 알 수 있다. threshold 값이 작을수록 \(\hat{\theta}\) 또한 작아지는데 threshold 값이 낮아지면 cluster의 갯수 또한 많아지기 때문이라고 한다.
우리는 보통 높은 값에서 clustering tendancies들에 대해 알고 싶어하기 때문에, 앞에서 추정한 세 개의 변수가 모두 상수가 되는 쪽에서 결정하도록 한다. 여기서는 13mm 또는 그 이상 정도로 threshold를 놓는 것이 적당할 것이다.
다음은 threshold가 13mm 이상일 때 (13, 15, 17, 19) goodness-of-fit 체크를 위해 Q-Q plot을 그러보았다.
#goodness of fit of our model for cluster occurrance by using the Q-Q plot
ei<-extremalIndex(rain,threshold=13)
g1<-ggplot(ei)
ei<-extremalIndex(rain,threshold=15)
g2<-ggplot(ei)
ei<-extremalIndex(rain,threshold=17)
g3<-ggplot(ei)
ei<-extremalIndex(rain,threshold=19)
g4<-ggplot(ei)
gridExtra::grid.arrange(g1,g2,g3,g4,ncol=2)
이 결과로써 알 수 있는 것은 13mm 정도로 threshold를 설정하면 좋다는 것이다. 이 때 extremal index의 추정값은 0.7 정도로, average cluster size는 \(1/0.7=1.4\) 정도 될 것임을 알 수 있다. 이것은 rainfall 자료는 다음 날의 영향을 받지만 그것은 기껏해야 1, 2일 정도 지속된다는 것이다.
다음으로는 automatic declustering technique을 사용하여 declustering을 하기로 한다.
#automatic declustering method
ei<-extremalIndex(rain,threshold=13)
ei
##
## Length of original series 17531
## Threshold 13
## Number of Threshold Exceedances 1294
## Intervals estimator of Extremal Index 0.7046188
plot(ei)
dc<-declust(ei)
dc
## declust.extremalIndex(y = ei)
##
## Threshold 13
## Declustering using the intervals method, run length 3
## Identified 817 clusters.
또한 texmex 패키지 안에 declust라는 함수가 있어 이것을 사용하여 extremalIndex를 구하는 단계, declustering을 하는 단계를 통합하여 할 수 있다.
dc<-declust(rain,threshold=13)
plot(dc)
결과를 살펴보면 원 자료 17531 중 1294개가 13mm를 초과하였고, serial dependence를 고려하였을 때에는 오직 817개의 independent cluster들만이 존재했음을 알 수 있다.
이제는 declustering으로 얻은 cluster maximum들을 활용하여 GPD 모수들을 추정해 보려고 한다. texmex 패키지에서는 declustered series를 바로 evm 함수로 적합할 수 있다.
########################################
#Fitting the Generalised Pareto distribution to cluster maxima
########################################
rain.gpd<-evm(dc)
rain.gpd
## Call: evm.declustered(y = dc)
## Family: GPD
##
## Model fit by maximum likelihood.
##
## Convergence: TRUE
## Threshold: 13
## Rate of excess: 0.0466
##
## Log. lik AIC DIC
## -2692.660 5389.319 5389.424
##
##
## Coefficients:
## Value SE
## phi: 2.31937 0.04524
## xi: -0.02353 0.02871
결과에서 rate of exceess라는 것은 the rate at which the cluster maxima occur in the original series를 말한다. 13mm threshold를 사용했을 때 817개의 cluster가 있었으므로 rate of occurence는 다음과 같이 얻게 된다.
dc$nCluster
## [1] 817
length(rain)
## [1] 17531
dc$nCluster/length(rain)
## [1] 0.04660316
이제 diagnostic plot들을 출력해보자.
#We now look a the diagnostic plots to check the fit of the GPD:
ggplot(rain.gpd)
이 결과에서는 몇 가지 사소한 우려사항들을 출력해준다.
Cluster maxima로부터 얻은 GPD 모수 적합 결과를 declustering을 하지 않고 구한 GPD 추정값들(아래)과 비교해 보기로 한다.
#We can compare the parameter estimates of the GPD model fitted to the
#cluster maxima with those obtained by fitting the GPD to the original series
evm(rain,th=13)
## Call: evm(y = rain, th = 13)
## Family: GPD
##
## Model fit by maximum likelihood.
##
## Convergence: TRUE
## Threshold: 13
## Rate of excess: 0.07381
##
## Log. lik AIC DIC
## -4034.082 8072.164 8072.214
##
##
## Coefficients:
## Value SE
## phi: 2.10088 0.03709
## xi: 0.01682 0.02457
점추정값은 크게 차이가 없으며 SE는 클러스터를 취한 쪽이 더 크다. The key difference between the fits is that the independence assumption underlinning the fitting of both of these models is not met in the case where the GPD is fitted to all threshold exceedances, whereas the assumption that cluster maxima are indpendent is satisfied.
The rate of excess in the GPD fitted to the whole series refers to the rate at which the original series exceeds the threhsold of 13mm. 참고로 두 개의 결과는 각기 다른 자료에서 적합한 것이므로 AIC로 직접 비교할 수 없다.
떄때로 gpd는 penalty를 특정하거나 model parameter들에 prior information을 부여해서 MCMC로 추정할 수도 있다.
For example to simulate from the posterior distri- bution of the parameters, rather than the default estimation by using (penalised) maxium likelihood, we can do:
########################################
#MCMC
########################################
rain.mcmc <- evm(dc,method="simulate")
## 10000 steps taken
## 20000 steps taken
## 30000 steps taken
## 40000 steps taken
## Acceptance rate:0.356
rain.mcmc
## evm.declustered(y = dc, method = "simulate")
## Family: GPD
## Acceptance rate: 0.356
##
## MAP estimates:
## phi: xi:
## 2.3193680 -0.0235276
##
## Posterior means:
## phi: xi:
## 2.31640951 -0.01890329
이 때 결과는 evmSim이라는 클래스 오브젝트로 반환된다.
Return levels can be computed from the GPD fitted to cluster maxima in the usual way. Note that the return levels computed from the declustered data refer to the occurrance of cluster maxima, rather than all threshold excesses and need to be interpreted accordingly.
M <- seq(30,1000)
p1 <- ggplot(predict(rain.gpd,M=M,ci.fit=TRUE), main="Return level plot: cluster max")
p2 <- ggplot(predict(evm(rain,th=13),M=M,ci.fit=TRUE), main="Return level plot: original series")
breaks <- c(20,50,100,200,500,1000)
grid.arrange(p1[[1]] + scale_x_continuous(trans="log",breaks=breaks),
p2[[1]] + scale_x_continuous(trans="log",breaks=breaks), ncol=1)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
The return levels calculated for the cluster maxima and the original series do not differ enormously. Close to the threhsold, excesses of a given level are rarer under the fitted cluster-max model than under the fitted original series model. This is is explained by the trivial observation that cluster maxima occur at a lower rate than do threshold exceedances. Away from the fitting threshold, the two curves give very similar estimates of return levels. This reflects the relatively weak dependence in the original series so that clusters are of short duration and the declustered series is not so different from the original one.
The procedure described above, of declustering and then fitting the GPD to cluster maxima gives a valid statistical model whose underlying assumptions are met. 그러나 실전에서 cluster maxima를 구하는 것이 궁극적인 목표는 아닌 것임을 생각해 두도록 하자. 예를 들면, rainfall information은 flood damage의 assessment가 ultimate goal인 경우 큰 도움이 될 것이다. Here it may be more informative to analyse complete clusters and obtain an understanding of the aggregate rainfall over a rainy spell, rather than to focus on the largest daily value over that spell. This problem is inherently more difficult and requires a much more sophisticated solution; it is not attempted here.
A further point that deserves to be made concerns the choice of interex- ceedance time cut-off which divides those times into intra-cluster (short times) and inter-cluster times (long times). The above algorithm offers an automatic procedure for deciding the cut-off time, based on a statistical model whose as- sumptions can be examined. However in applications where there is a valid time horizon within which threshold exceedances should be considered to arise within the same cluster, then this statistical model based approach should not over-ride the science based argument in the case where the methods suggest different answers.
In cases where there is a clear a priori argument for selecting a cluster separation time, then the declustering can be carried out by using the so-called runs declustering algorithm (see Coles, 2001 [1], Chapter 5). Here we must specify the run-length r, the minimum number of consecutive values which must lie below the threshold before a cluster is deemed to be complete. In texmex we do:
declust(rain,th=13,r=5)
## declust.extremalIndex(y = ei, r = r)
##
## Threshold 13
## Declustering using the runs method, run length 5
## Identified 696 clusters.
Specifying a run length r forces the use of the runs declustering method. In contrast, when we fail to specify this then the intervals method automaticically fixes the run length following estimation of the extremal index:
declust(rain,th=13)
## declust.extremalIndex(y = ei, r = r)
##
## Threshold 13
## Declustering using the intervals method, run length 3
## Identified 817 clusters.
여기에서는 cluster maxima의 GPD 모델에서 covariate들을 고려하는 가능성을 보도록 하자. This is demonstrated by using the five-dimensional winter air pollution dataset included in texmex.
For more information on this dataset type
help(winter)
We look at the sulphur dioxide component of this series which exhibits tem- poral dependence, and following declustering of this series attempt to model the excesses of a threshold by cluster maxima by using the remaining variables in the dataset as explanatory variables. The serial dependence in the \(\text{SO}_{2}\) data is apparent from a plot of the data in which the largest values appear clumped together:
ggplot(data=data.frame(Index = 1:length(winter$SO2), SO2 = winter$SO2), aes(Index,SO2)) + geom_point(colour="dark blue",alpha=0.5)
The \(\text{SO}_{2}\) variable is declustered whilst retaining the covariate structure of the data frame which holds the data. First we select a threshold:
erf <- extremalIndexRangeFit(SO2,data=winter,umin=10,umax=40,verb=FALSE)
g <- ggplot(erf)
grid.arrange(g[[1]],g[[2]],g[[3]],ncol=1)
This suggests that we should not use a threshold of below 15. We further examine threshold choice by using Q-Q plots:
g1 <- ggplot(extremalIndex(SO2,data=winter,threshold=15))
g2 <- ggplot(extremalIndex(SO2,data=winter,threshold=20))
g3 <- ggplot(extremalIndex(SO2,data=winter,threshold=25))
g4 <- ggplot(extremalIndex(SO2,data=winter,threshold=30))
gridExtra::grid.arrange(g1,g2,g3,g4,ncol=2)
This suggests that a threshold of around 20 gives the best fit to the data. We proceed by using this value.
ei <- extremalIndex(SO2,data=winter,threshold=20)
ei
##
## Length of original series 532
## Threshold 20
## Number of Threshold Exceedances 192
## Intervals estimator of Extremal Index 0.5206914
The extremal index estimate of around 0.5 has the interpretation of cluters occuring with an average size of two. We can decluster using this estimate of the extremal index:
d <- declust(ei)
ggplot(d)
d
## declust.extremalIndex(y = ei)
##
## Threshold 20
## Declustering using the intervals method, run length 1
## Identified 77 clusters.
So of the 192 threshold exceedances, only 77 are independent cluster maxima.
We can now fit the GPD model to the excesses of cluster maxima above our threshold of 20. There are no covariates in the GPD model at this stage.
so2.gpd <- evm(d)
so2.gpd
## Call: evm.declustered(y = d)
## Family: GPD
##
## Model fit by maximum likelihood.
##
## Convergence: TRUE
## Threshold: 20
## Rate of excess: 0.1447
##
## Log. lik AIC DIC
## -328.5506 661.1012 661.6475
##
##
## Coefficients:
## Value SE
## phi: (Intercept) 3.0552 0.1935
## xi: (Intercept) 0.2121 0.1583
ggplot(so2.gpd)
The diagnostic plots show a small kink (엉클어짐, 꼬임, 비틀림) in the Q-Q plot. We now look into whether we can improve upon this slight lack of fit by covariate modelling. We can examine the suitability of the remaining air quality variables as explanatory variables in linear predictors for the GPD model parameters by using scatter plots:
p1 <- ggplot(data=winter,aes(O3,SO2)) + geom_point(colour="dark blue",alpha=0.5)
p2 <- ggplot(data=winter,aes(NO2,SO2)) + geom_point(colour="dark blue",alpha=0.5)
p3 <- ggplot(data=winter,aes(NO,SO2)) + geom_point(colour="dark blue",alpha=0.5)
p4 <- ggplot(data=winter,aes(PM10,SO2)) + geom_point(colour="dark blue",alpha=0.5)
grid.arrange(p1,p2,p3,p4,ncol=2)
Here we are not looking for linear regression type relationship but instead whether the scatter (scale) or tail behaviour (shape) of the \(\text{SO}_{2}\) variable depend on the candidate explanatory variable. The Ozone variabe looks like a possible candidate since the largest values of SO2 are much more scattered for small values of Ozone than for large. We will look at the fitted models for each of the candidate explanatory variables, and examine the AIC (Akaike Information Criterion) for each in turn. A model that gives a better fit to the data – beyond that which we would expect to see simply due to the addition of more model parameters – will have a reduced values of AIC. We favour models with lower AIC. The absolute value of AIC is not of interest in itself as this is a function of the exact choice of data used to fit the model. The values of AIC for models fit to different sets of data (for instance to threshold excesses defined by different choices of threshold) are not comparable.
Covariate models are fitted to the declustered data object by using the model formula syntax, specifying the name of the parameter to be modelled with a covariate and also the name of the column of the original data frame containing the covariate. For example, to include the covariate NO in the linear predictor for the log scale parameter, phi, we do:
evm(d,phi=~NO)
## Call: evm.declustered(y = d, phi = ~NO)
## Family: GPD
##
## Model fit by maximum likelihood.
##
## Convergence: TRUE
## Threshold: 20
## Rate of excess: 0.1447
##
## Log. lik AIC DIC
## -328.4744 662.9488 663.2072
##
##
## Coefficients:
## Value SE
## phi: (Intercept) 3.138860 0.283911
## phi: NO -0.000399 0.001013
## xi: (Intercept) 0.197205 0.160885
The model is estimated by using the treatment contrasts parameterisation, so that we estimate an intercept for the parameter phi (the value taken by this parameter for a zero value of the covariate) and also a gradient. Here the estimated gradient shows a very slight decline in phi for every positive unit change in NO (although this gradient parameter is not significantly different from zero).
We can examine the AIC for each of the candidate explanatory variables as follows:
AIC(evm(d,phi=~NO))
## AIC DIC
## 662.9488 663.7457
AIC(evm(d,phi=~NO2))
## AIC DIC
## 662.8852 663.2778
AIC(evm(d,phi=~O3))
## AIC DIC
## 651.8719 652.5211
AIC(evm(d,phi=~PM10))
## AIC DIC
## 662.9997 663.5600
산점도에서 본 것처럼, 오존 변수가 best model fit를 가지며, AIC 값이 낮은것에서도 확인할 수 있다, and a significant improvement over the model fitted with no covariates. 이번엔 우리가 shape parameter xi에 linear predictor를 additional covariate로 포함했을 때 더 나은 결과를 얻을 수 있는지 검사해보고자 한다.
AIC(evm(d,phi=~O3,xi=~NO))
## AIC DIC
## 651.7845 651.8335
AIC(evm(d,phi=~O3,xi=~NO2))
## AIC DIC
## 653.2484 653.8503
AIC(evm(d,phi=~O3,xi=~O3))
## AIC DIC
## 653.6356 654.4013
AIC(evm(d,phi=~O3,xi=~PM10))
## AIC DIC
## 652.9210 653.5952
So we do not gain further by including any of these variables in an expression for xi. We reach the same conclusion (working not shown here) if we examine the inclusion of covariates only in the linear predictor for xi, and not in the (log) scale parameter.
Finally we look at the model diagnostics for our preferred covariate model:
so2.o3.gpd <- evm(d,phi=~O3)
ggplot(so2.o3.gpd)
So we have managed to get rid of the previous kink (엉클어짐, 꼬임, 비틀림) in the Q-Q plot which appeared for the model when fitted with no covariates. We can also plot the residuals from each of our models against the O3 covariate:
O3 <- winter[winter$SO2>d$threshold,"O3"][d$isClusterMax]
p1 <- ggplot(data=data.frame(O3=O3, Residuals = resid(so2.gpd)), aes(O3,Residuals)) + geom_point(colour="dark blue",alpha=0.5) + labs(title="No covariate model") + geom_smooth()
p2 <- ggplot(data=data.frame(O3=O3, Residuals = resid(so2.o3.gpd)), aes(O3,Residuals)) + geom_point(colour="dark blue",alpha=0.5) + labs(title="O3 covariate model") + geom_smooth()
grid.arrange(p1,p2,ncol=1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
These plots show the covariate model captures the dependence of the no covariate model residual scatter on the Ozone variable.
Now that our preferred model for \(\text{SO}_{2}\) has model parameters that are specified as a function of a covariate – O3 – we now have to specify a value of \(\text{O}_{3}\) to calculate a return level for \(\text{SO}_{2}\).
For instance, to look at \(\text{O}_{3}\) values of 10, 15 and 20 and predicted return levels for each of these environmental conditions, we can do:
newdata <- data.frame(O3=c(10,15,20))
so2.o3.rl <- predict(so2.o3.gpd,M=seq(100,1000,len=100),ci.fit=TRUE, newdata=newdata)
g <- ggplot(so2.o3.rl)
breaks <- c(100,200,500,1000)
grid.arrange(g[[1]] + scale_x_continuous(trans="log",breaks=breaks) + scale_y_continuous(limits=c(0,350)), g[[2]] + scale_x_continuous(trans="log",breaks=breaks) + scale_y_continuous(limits=c(0,350)), g[[3]] + scale_x_continuous(trans="log",breaks=breaks) + scale_y_continuous(limits=c(0,350)),ncol=3)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
The unit of time to which the return period now refers is the cluster, since the GPD model has been fitted to cluster maxima.
The interpretation of GPD models fitted with covariates included in the linear predictors for model parameters requires some delicacy. We put this in context by recalling the general approach taken within the statistical extreme value modelling paradigm. All extreme value statistical models are motivated by the limiting behaviour of random variables as we look at more and more extreme values of these variables. These limiting arguments provide statisticians with some theoretical justification for using these models in regions of the sample space where data are scarce and unreliable for model validation; this is par- ticularly helpful when we are required to extrapolate from our models, beyond levels that have been seen in the data for model fitting. We exploit stability properties of these models that arise as a consequence of the limiting arguments from which they are born.
These useful model properties and the interpretation of model parameters are somewhat muddied when we use covariates in the linear predictors for the parameters of the extreme value models. As for any regression type model, the models are conditional on the observed values of the covariate. This is satisfactory when the covariate can be fixed by design, and it makes sense to think of holding its value constant while we extrapolate the response variable. In the context of our winter air pollution example however, the Ozone variable was not set at a design value, but observed jointly with the other air quality variables. It is not at all clear from this setting whether it is appropriate to think about holding the value of Ozone fixed while examining ever higher values of \(\text{SO}_{2}\).
The issue of threshold selection also requires further consideration in the context of covariate modelling. For our simple example above, we used a con- stant threshold for the \(\text{SO}_{2}\) declustering and subsequent modelling of cluster maxima. The validity of this modelling practice for non-stationary processes is rather questionable. Our approach above asserts that the probablity of thresh- old excess is not affected by the covariate value. This is a very strong assumption and has been called into doubt in the literature for applications of this type. A possible solution is to use a variable threhsold, possibly also depending on the covariate.
Dey, Dipak K., and Jun Yan. 2015. Extreme Value Modeling and Risk Analysis. Taylor & Francis Inc. https://www.ebook.de/de/product/24179421/extreme_value_modeling_and_risk_analysis.html.
Ferro, Christopher A. T., and Johan Segers. 2003. “Inference for Clusters of Extreme Values.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2): 545–56. https://doi.org/10.1111/1467-9868.00401.
Leadbetter, M. R. 1983. “Extremes and Local Dependence in Stationary Sequences.” Zeitschrift F�r Wahrscheinlichkeitstheorie Und Verwandte Gebiete 65 (2): 291–306. https://doi.org/10.1007/bf00532484.